Time feature is useless, so we can drop it from the dataset

Also Pass/Fail column can be modified slightly for better clarity

Percentage of missing values in all the features

There are features with large number of missing values (upto 91%), which needs to be handled

Percentage of missing values in all the features

Large number of zeros are present. Many features have only 1 value, i.e. 0 throughout

More than 250 features have extremely low variance (<0.1), thus having minimal contribution in the output

Drop features with high missing values and low variance

Checking for multicollinearity

There are several highly multicollinear (high vif value) features. Generally vif>10 is considered as high. Let's remove these features as well

Drop features with high multicollinearity

Percentage of missing values and NaN in all the features

These two are the same features, which have high zeros, as shown in the previous plotly graph

These two features offer no value in terms of predicting target column.

Let's also check if any other features is dominated by any value other than zero

Thus, other than these two, no other feature is heavily dominated by a single value

Drop 2 features which are not adding any value

Checking for skewness

Generally, skewness of more than +1 or less than -1, is considered as high. In this case, skewness is extremely high, i.e. the distribution of many features are highly non-normal and are expected to have extreme outliers, which could affect the prediction accuracies of many classifiers

The skewness maybe because of potential outliers.

Percentage of Outliers in all the features

There are too many IQR outliers to remove. In case of removal, there is a possibility that the nature of data might change, gievn the small size of 'Fail' data. Thus it might be better to use a different strategy.

Let us use quantile transformation to handle Outliers

All the outliers have been handled

imputation (with 0) on main data

The dataset is highly imbalanced

Model Building

PCA

Covariance matrix

Finding optimum value for n_components

Sorting the eigen values in descending order

Variance captured calculation

105 elements capture about 95% of the variance

Train test split

1.1 SVM Classifier with PCA

1.2 SVM Classifier Without PCA

2.1. xgboost Classifier With PCA

XGB classifier with PCA overfits

2.2. xgboost Classifier Without pca

XGB classifier without PCA overfits

3.1 Logistic regression with pca

3.2 Logistic regression without pca and scaling

Logistic regression without pca and scaling overfits

Selecting the final best model

Logistic regression with PCA and SVM Classifier Without PCA(both have equal scores) have the highest f1-scores for class 1(minority class) in test data when compared to other models. So we can choose any one of them as our final model.

When we consider the Cross validation f1-score, we can surely say that Logistic regression with PCA perform well in unseen data.Logistic regression with PCA also has the minimum standard deviation in CV f1-score.Therefore i am choosing Logistic regression with PCA as my final model.

Pickling the final model

Import the future data file. Use the same to perform the prediction using the best chosen model from above. Display the prediction results

Prediction of future data

Conclusion

The dataset contains lot of missing values in it. Some of the features had about 91% of missing values in them. Next time during data collection please try to collect the data as much as possible without any missing values. Also the dataset is highly imbalanced(imbalance ratio 14:1), try to collect the without any skewness. The model that i have built tend to perform more well on the majority class records, this is because of the data imbalance. XGB classifier overfits the training data. SVC consumes more time as it is compute intensive. I have chosen Logistic regression because it handles the data imbalance comparitively well and also it is less compute intense.

I have implemented pipeline, hyperparameter tuning, cross validation, Dimensionality reduction techniques in this project.